library(readxl) # to read in an .xlsx filelibrary(ggrepel) # to help label residual plotslibrary(ggdist) # new package not previously usedlibrary(broom)library(kableExtra)library(Hmisc); library(mosaic)library(janitor)library(naniar)library(patchwork)library(tidyverse)theme_set(theme_bw())options(tidyverse.quiet =TRUE)options(dplyr.summarise.inform =FALSE)
One-Factor Analysis of Variance: Comparing Multiple Means with Independent Samples
Today’s Data (ohio_2020)
ohio_2020.xlsx rows describe Ohio’s 88 counties:
FIPS code (identifier for mapping), state and county name
health outcomes (standardized: more positive means better outcomes, because we’ve taken the negative of the Z score CHR provides)
health behavior ranking (1-88, we’ll divide into 4 groups)
clinical care ranking (1-88, we’ll split into 3 groups)
proportion of county residents who live in rural areas
median income, in dollars
proportion of votes in the 2016 Presidential Election for Donald Trump
Counties in the Best group had the best behavior results.
Key Measure Details
clin_care = (Strong/Middle/Weak) reflects rates of uninsured, care providers, preventable hospital stays, diabetes monitoring and mammography screening.
Strong means that clinical care is strong in this county.
Today’s First Question
How do average health outcomes vary across groups of counties defined by health behavior?
K Samples: Comparing Means
What is the outcome under study?
What are the (in this case, \(K \geq 2\)) treatment/exposure groups?
Were the data in fact collected using independent samples?
Are the data random samples from the population(s) of interest? Or is there at least a reasonable argument for generalizing from the samples to the population(s)?
What is the significance level (or, the confidence level) we require?
Are we doing one-sided or two-sided testing? (usually 2-sided)
What does the distribution of each individual sample tell us about which inferential procedure to use?
Are there statistically detectable differences between population means?
If an overall test rejects the null, can we identify pairwise comparisons of means that show detectable differences using an appropriate procedure that protects against Type I error expansion due to multiple comparisons?
Question 1
Do average health outcomes differ by health behavior?
ggplot(ohio20, aes(x = behavior, y = outcomes, fill = behavior)) +geom_violin(alpha =0.25) +geom_boxplot(width =0.25) +guides(fill ="none") +scale_fill_brewer(palette ="Spectral", direction =-1) +labs(x ="Health Behavior Group", y ="Health Outcomes (higher = better health)",title ="Health Outcomes across Behavior Groups",subtitle ="Ohio's 88 counties, 2020 County Health Rankings",caption ="Source: https://www.countyhealthrankings.org/app/ohio/2020/downloads")
Question 1
Question 1 Raindrop Plots?
ggplot(ohio20, aes(x = behavior, y = outcomes, fill = behavior)) + ggdist::stat_halfeye(adjust =0.5, width =0.3, .width =c(0.5, 1)) + ggdist::stat_dots(side ="left", dotsize =1, justification =1.05, binwidth =0.1) +guides(fill ="none") +scale_fill_brewer(palette ="Spectral", direction =-1) +labs(x ="Health Behavior Group", y ="Health Outcomes (higher = better health)",title ="Health Outcomes across Behavior Groups",subtitle ="Ohio's 88 counties, 2020 County Health Rankings",caption ="Source: https://www.countyhealthrankings.org/app/ohio/2020/downloads")
Question 1 Raindrop Plots?
Question 1 Numerical Summaries
How do average health outcomes vary across groups of counties defined by health behavior?
Df Sum Sq Mean Sq F value Pr(>F)
behavior 3 46.42 15.474 57.72 <2e-16 ***
Residuals 84 22.52 0.268
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, what’s the conclusion? Is this a surprise?
Multiple Comparisons
What’s Left? (Multiple Comparisons)
If an overall test rejects the null, can we identify pairwise comparisons of means that show detectable differences using an appropriate procedure that protects against Type I error expansion due to multiple comparisons?
Yes.
Two Methods for Multiple Comparisons
There are two methods we’ll study to identify specific pairs of means where we have statistically detectable differences, while dealing with the problem of multiple comparisons.
Pairwise comparisons using t tests with pooled SD
data: ohio20$outcomes and ohio20$behavior
Best High Low
High 1.8e-05 - -
Low 1.4e-10 0.007 -
Worst < 2e-16 1.6e-12 3.6e-07
P value adjustment method: none
The problem of Multiple Comparisons
The more comparisons you do simultaneously, the more likely you are to make an error.
In the worst case scenario, suppose you do two tests - first A vs. B and then A vs. C, each at the \(\alpha = 0.10\) level.
What is the combined error rate across those two t tests?
The problem of Multiple Comparisons
Run the first test. Make a Type I error 10% of the time.
A vs B Type I error
Probability
Yes
0.1
No
0.9
Now, run the second test. Assume (perhaps wrongly) that comparing A to C is independent of your A-B test result. What is the error rate now?
The problem of Multiple Comparisons
Assuming there is a 10% chance of making an error in either test, independently …
–
Error in A vs. C
No Error
Total
Type I error in A vs. B
0.01
0.09
0.10
No Type I error in A-B
0.09
0.81
0.90
Total
0.10
0.90
1.00
So you will make an error in the A-B or A-C comparison 19% of the time, rather than the nominal \(\alpha = 0.10\) error rate.
But we’re building SIX tests
Best vs. High
Best vs. Low
Best vs. Worst
High vs. Low
High vs. Worst
Low vs. Worst
and if they were independent, and each done at a 5% error rate, we could still wind up with an error rate of
If we do 6 tests, we could reduce the necessary \(\alpha\) to 0.05 / 6 = 0.0083 and that maintains an error rate no higher than \(\alpha = 0.05\) across the 6 tests.
Pairwise comparisons using t tests with pooled SD
data: ohio20$outcomes and ohio20$behavior
Best High Low
High 0.00011 - -
Low 8.3e-10 0.04224 -
Worst < 2e-16 9.4e-12 2.1e-06
P value adjustment method: bonferroni
We still detect a meaningful difference between each pair of groups.
Better Approach: Holm-Bonferroni
Suppose you have \(m\) comparisons, with p-values sorted from low to high as \(p_1\), \(p_2\), …, \(p_m\).
Is \(p_1 < \alpha/m\)? If so, reject \(H_1\) and continue, otherwise STOP.
Is \(p_2 < \alpha/(m-1)\)? If so, reject \(H_2\) and continue, else STOP.
and so on…
Holm-Bonferroni Approach
This is uniformly more powerful than Bonferroni, while preserving the overall false positive rate at \(\alpha\).
Pairwise comparisons using t tests with pooled SD
data: ohio20$outcomes and ohio20$behavior
Best High Low
High 3.5e-05 - -
Low 5.5e-10 0.007 -
Worst < 2e-16 7.9e-12 1.1e-06
P value adjustment method: holm
Tukey’s Honestly Significant Differences
Tukey’s HSD approach is a better choice for pre-planned comparisons with a balanced (or nearly balanced) design. It provides confidence intervals and an adjusted p value for each comparison.
Let’s run some confidence intervals to yield an overall 99% confidence level, even with 6 tests…
TukeyHSD(aov(lm(outcomes ~ behavior, data = ohio20)), conf.level =0.99, ordered =TRUE)
Tukey’s Honestly Significant Differences
Tukey multiple comparisons of means
99% family-wise confidence level
factor levels have been ordered
Fit: aov(formula = lm(outcomes ~ behavior, data = ohio20))
$behavior
diff lwr upr p adj
Low-Worst 0.8632211 0.36223069 1.3642115 0.0000021
High-Worst 1.2945256 0.79353515 1.7955159 0.0000000
Best-Worst 2.0056105 1.50462011 2.5066009 0.0000000
High-Low 0.4313045 -0.06968593 0.9322949 0.0348350
Best-Low 1.1423894 0.64139903 1.6433798 0.0000000
Best-High 0.7110850 0.21009456 1.2120753 0.0001023
ggplot(tukey_one, aes(x =reorder(contrast, -estimate), y = estimate)) +geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +geom_hline(yintercept =0, col ="red", linetype ="dashed") +geom_text(aes(label =round(estimate,2)), nudge_x =-0.2) +labs(x ="Contrast between Behavior Groups", y ="Estimated Effect, with 99% Tukey HSD interval",title ="Estimated Effects, with Tukey HSD 99% Confidence Intervals",subtitle ="Comparing Outcomes by Behavior Group, ohio20 data")
Plot Tukey HSD intervals
ANOVA Assumptions
The assumptions behind analysis of variance are those of a linear model. Of specific interest are:
The samples obtained from each group are independent.
Ideally, the samples from each group are a random sample from the population described by that group.
In the population, the variance of the outcome in each group is equal. (This is less of an issue if our study involves a balanced design.)
In the population, we have Normal distributions of the outcome in each group.
Happily, the ANOVA F test is fairly robust to violations of the Normality assumption.
Residual Plots for model_one
aug_one <-augment(model_one, ohio20)p1 <-ggplot(aug_one, aes(x = .fitted, y = .resid)) +geom_point() +geom_smooth(method ="lm", formula = y ~ x, se = F,lty ="dashed", col ="red") +geom_text_repel(data = aug_one |>slice_max(abs(.resid), n =3), aes(label = county)) +labs(title ="model_one Residuals vs. Fitted",x ="Fitted Value from model_one",y ="Residuals from model_one")p2 <-ggplot(aug_one, aes(sample = .resid)) +geom_qq() +geom_qq_line(col ="red") +labs(title ="model_one Residuals",y ="")p3 <-ggplot(aug_one, aes(y = .resid, x ="")) +geom_violin(fill ="aquamarine") +geom_boxplot(width =0.5) +labs(y ="", x ="")p1 + p2 + p3 +plot_layout(widths =c(5, 4, 1))
Residual Plots for model_one
Can we avoid assuming equal population variances?
Yes, but this isn’t exciting if we have a balanced design.
oneway.test(outcomes ~ behavior, data = ohio20)
One-way analysis of means (not assuming equal variances)
data: outcomes and behavior
F = 43.145, num df = 3.000, denom df = 45.494, p-value = 2.349e-13
Note that this approach uses a fractional degrees of freedom calculation in the denominator.
The Kruskal-Wallis Test
If you thought the data were severely skewed, you might try:
kruskal.test(outcomes ~ behavior, data = ohio20)
Kruskal-Wallis rank sum test
data: outcomes by behavior
Kruskal-Wallis chi-squared = 61.596, df = 3, p-value = 2.681e-13
\(H_0\): The four behavior groups have the same center to their outcomes distributions.
\(H_A\): At least one group has a shifted distribution, with a different center to its outcomes.
What would be the conclusion here?
K Samples: Comparing Means
What is the outcome under study?
What are the (in this case, \(K \geq 2\)) treatment/exposure groups?
Were the data in fact collected using independent samples?
Are the data random samples from the population(s) of interest? Or is there at least a reasonable argument for generalizing from the samples to the population(s)?
What is the significance level (or, the confidence level) we require?
Are we doing one-sided or two-sided testing? (usually 2-sided)
What does the distribution of each individual sample tell us about which inferential procedure to use?
Are there statistically detectable differences between population means?
If an overall test rejects the null, can we identify pairwise comparisons of means that show detectable differences using an appropriate procedure that protects against Type I error expansion due to multiple comparisons?
Two-Factor Analysis of Variance
A Two-Factor Example
Suppose we want to simultaneously understand the impacts of two factors on our standardized health outcomes?
health behavior ranking (divided into 4 groups)
clinical care ranking (divided into 3 groups)
and it is possible that the impact of the health behavior ranking on our outcome measure may depend on the clinical care ranking, and vice versa (i.e. the factors may interact.)
model_two <-lm(outcomes ~ clin_care, data = ohio20)anova(model_two)
Analysis of Variance Table
Response: outcomes
Df Sum Sq Mean Sq F value Pr(>F)
clin_care 2 7.232 3.6159 4.9807 0.009007 **
Residuals 85 61.708 0.7260
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual Plots for model_two
aug_two <-augment(model_two, ohio20)p1 <-ggplot(aug_two, aes(x = .fitted, y = .resid)) +geom_point() +geom_smooth(method ="lm", formula = y ~ x, se = F,lty ="dashed", col ="red") +geom_text_repel(data = aug_two |>slice_max(abs(.resid), n =3), aes(label = county)) +labs(title ="model_two Residuals vs. Fitted",x ="Fitted Value from model_two",y ="Residuals from model_two")p2 <-ggplot(aug_two, aes(sample = .resid)) +geom_qq() +geom_qq_line(col ="red") +labs(title ="model_two Residuals",y ="")p3 <-ggplot(aug_two, aes(y = .resid, x ="")) +geom_violin(fill ="aquamarine") +geom_boxplot(width =0.5) +labs(y ="", x ="")p1 + p2 + p3 +plot_layout(widths =c(5, 4, 1))
Residual Plots for model_two
Question 2 Kruskal-Wallis test
kruskal.test(outcomes ~ clin_care, data = ohio20)
Kruskal-Wallis rank sum test
data: outcomes by clin_care
Kruskal-Wallis chi-squared = 8.3139, df = 2, p-value = 0.01566
K Samples: Comparing Means
What is the outcome under study?
What are the (in this case, \(K \geq 2\)) treatment/exposure groups?
Were the data in fact collected using independent samples?
Are the data random samples from the population(s) of interest? Or is there at least a reasonable argument for generalizing from the samples to the population(s)?
What is the significance level (or, the confidence level) we require?
Are we doing one-sided or two-sided testing? (usually 2-sided)
What does the distribution of each individual sample tell us about which inferential procedure to use?
Are there statistically meaningful differences between population means?
If an overall test rejects the null, can we identify pairwise comparisons of means that show detectable differences using an appropriate procedure that protects against Type I error expansion due to multiple comparisons?
Did Former President Trump’s vote percentage in 2016 vary meaningfully across groups of counties defined by educational attainment?
Trump 2016 % by Educational Attainment
ggplot(ohio20, aes(x = educ, y = trump16, fill = educ)) +geom_violin(alpha =0.25) +geom_boxplot(width =0.25) +guides(fill ="none") +scale_fill_brewer(palette ="Spectral", direction =-1) +labs(x ="Education Group (2020 County Health Rankings)", y ="Proportion of Vote for Trump in 2016 Election",title ="Proportion of Trump Vote by 'Some College' Group",subtitle ="Ohio's 88 counties")
Did Former President Trump’s vote percentage in 2016 vary meaningfully across income?
Trump 2016 % by Income
ggplot(ohio20, aes(x = income, y = trump16, fill = income)) +geom_violin(alpha =0.25) +geom_boxplot(width =0.25) +guides(fill ="none") +scale_fill_brewer(palette ="Spectral", direction =-1) +labs(x ="Income Group (2020 County Health Rankings)", y ="Proportion of Vote for Trump in 2016 Election",title ="Proportion of Trump Vote by Income Group",subtitle ="Ohio's 88 counties")
ggplot(tukey_4, aes(x =reorder(contrast, -estimate), y = estimate)) +geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +geom_hline(yintercept =0, col ="red", linetype ="dashed") +geom_label(aes(label =round_half_up(estimate,2))) +coord_flip() +labs(x ="Contrast between Income Groups", y ="Estimated Effect, with 90% Tukey HSD interval",title ="Estimated Effects, with Tukey HSD 90% Confidence Intervals",subtitle ="Comparing Trump16 Vote % by Income Group, ohio20 data")
Plotting Tukey HSD intervals (Income Groups)
K Samples: Comparing Means
What is the outcome under study?
What are the (in this case, \(K \geq 2\)) treatment/exposure groups?
Were the data in fact collected using independent samples?
Are the data random samples from the population(s) of interest? Or is there at least a reasonable argument for generalizing from the samples to the population(s)?
What is the significance level (or, the confidence level) we require?
Are we doing one-sided or two-sided testing? (usually 2-sided)
What does the distribution of each individual sample tell us about which inferential procedure to use?
Are there statistically detectable differences between population means?
If an overall test rejects the null, can we identify pairwise comparisons of means that show detectable differences using an appropriate procedure that protects against Type I error expansion due to multiple comparisons?
New Two-Way ANOVA Example
Suppose we’re interested in the mean trump16 and how it might be influenced by both the income and the educ groups.
model_5noint <-lm(trump16 ~ income + educ, data = ohio20)model_5int <-lm(trump16 ~ income * educ, data = ohio20)anova(model_5int)
Analysis of Variance Table
Response: trump16
Df Sum Sq Mean Sq F value Pr(>F)
income 3 48.8 16.27 0.3395 0.796813
educ 4 5114.7 1278.68 26.6792 1.596e-13 ***
income:educ 9 1200.2 133.35 2.7824 0.007511 **
Residuals 71 3402.9 47.93
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# A tibble: 17 × 3
# Groups: income [4]
income educ mean_trump
<fct> <fct> <dbl>
1 Lowest Least 67.9
2 Lowest Low 68.5
3 Lowest Middle 69.7
4 Lowest High 41.2
5 Low Least 73.0
6 Low Low 68.6
7 Low Middle 64.5
8 Low High 58.3
9 Low Most 39.2
10 High Least 69.8
11 High Low 69.1
12 High Middle 70.9
13 High High 60.7
14 High Most 44.3
15 Highest Least 73.4
16 Highest High 68.3
17 Highest Most 61.9
Now, build the interaction plot.
Looking for a substantial interaction (non-parallel lines)
ggplot(out_summary5, aes(x = income, y = mean_trump)) +geom_line(aes(group = educ, color = educ)) +geom_point(aes(color = educ))
Now, build the interaction plot.
Two-Way ANOVA without Interaction
anova(model_5noint)
Analysis of Variance Table
Response: trump16
Df Sum Sq Mean Sq F value Pr(>F)
income 3 48.8 16.27 0.2828 0.8377
educ 4 5114.7 1278.68 22.2231 2.305e-12 ***
Residuals 80 4603.1 57.54
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow =c(1,2)); plot(model_5int, which =1:2); par(mfrow =c(1,1))
What conclusions should we draw?
The interaction term accounts for a relatively large percentage of the variation in our outcome, and for a large percentage (nearly 20%) of the variation we explain with our model.
The interaction plot suggests substantial interaction between the factors (lines are not at all parallel.)